Implement OpenAI token counting using `tiktoken` #3447

wirthual · 2025-11-16T23:48:29Z

Work for #3430

Took method 1:1 from OpenAI cookbook. However the example only shows for certain models. How should other models be handled? Quick test with gpt-5 showed a token number that differed from this method.

gpt-5
Warning: gpt-5 may update over time. Returning num tokens assuming gpt-5-2025-08-07.
110 prompt tokens counted by num_tokens_from_messages().
109 prompt tokens counted by the OpenAI API.

Test script

from openai import OpenAI
import os
import tiktoken

def num_tokens_from_messages(messages, model="gpt-4o-mini-2024-07-18"):
    """Return the number of tokens used by a list of messages."""
    try:
        encoding = tiktoken.encoding_for_model(model)
    except KeyError:
        print("Warning: model not found. Using o200k_base encoding.")
        encoding = tiktoken.get_encoding("o200k_base")
    if model in {
        "gpt-3.5-turbo-0125",
        "gpt-4-0314",
        "gpt-4-32k-0314",
        "gpt-4-0613",
        "gpt-4-32k-0613",
        "gpt-4o-mini-2024-07-18",
        "gpt-4o-2024-08-06",
        "gpt-4.1-2025-04-14",
        "gpt-5-2025-08-07",
        }:
        tokens_per_message = 3
        tokens_per_name = 1
    elif "gpt-3.5-turbo" in model:
        print("Warning: gpt-3.5-turbo may update over time. Returning num tokens assuming gpt-3.5-turbo-0125.")
        return num_tokens_from_messages(messages, model="gpt-3.5-turbo-0125")
    elif "gpt-4o-mini" in model:
        print("Warning: gpt-4o-mini may update over time. Returning num tokens assuming gpt-4o-mini-2024-07-18.")
        return num_tokens_from_messages(messages, model="gpt-4o-mini-2024-07-18")
    elif "gpt-4o" in model:
        print("Warning: gpt-4o and gpt-4o-mini may update over time. Returning num tokens assuming gpt-4o-2024-08-06.")
        return num_tokens_from_messages(messages, model="gpt-4o-2024-08-06")
    elif "gpt-4" in model:
        print("Warning: gpt-4 may update over time. Returning num tokens assuming gpt-4-0613.")
        return num_tokens_from_messages(messages, model="gpt-4-0613")
    elif "gpt-5" in model:
        print("Warning: gpt-5 may update over time. Returning num tokens assuming gpt-5-2025-08-07.")
        return num_tokens_from_messages(messages, model="gpt-5-2025-08-07")
    else:
        raise NotImplementedError(
            f"""num_tokens_from_messages() is not implemented for model {model}."""
        )
    num_tokens = 0
    for message in messages:
        num_tokens += tokens_per_message
        for key, value in message.items():
            num_tokens += len(encoding.encode(value))
            if key == "name":
                num_tokens += tokens_per_name
    num_tokens += 3  # every reply is primed with <|start|>assistant<|message|>
    return num_tokens

client = OpenAI(api_key=os.environ.get("OPENAI_API_KEY", "<your OpenAI API key if not set as env var>"))

example_messages = [
    {
        "role": "system",
        "content": "You are a helpful, pattern-following assistant that translates corporate jargon into plain English.",
    },
    {
        "role": "system",
        "content": "New synergies will help drive top-line growth.",
    },
    {
        "role": "system",
        "content": "Things working well together will increase revenue.",
    },
    {
        "role": "system",
        "content": "Let's circle back when we have more bandwidth to touch base on opportunities for increased leverage.",
    },
    {
        "role": "system",
        "content": "Let's talk later when we're less busy about how to do better.",
    },
    {
        "role": "user",
        "content": "This late pivot means we don't have time to boil the ocean for the client deliverable.",
    },
]

for model in [
    "gpt-3.5-turbo",
    "gpt-4",
    "gpt-4o",
    "gpt-4o-mini",
    "gpt-5"
    ]:
    print(model)
    # example token count from the function defined above
    print(f"{num_tokens_from_messages(example_messages, model)} prompt tokens counted by num_tokens_from_messages().")
    # example token count from the OpenAI API
    response = client.chat.completions.create(model=model,
    messages=example_messages)
    
    print(f'{response.usage.prompt_tokens} prompt tokens counted by the OpenAI API.')
    print()

DouweM · 2025-11-18T18:30:04Z

pydantic_ai_slim/pydantic_ai/_utils.py

    return event_loop
+
+
+def num_tokens_from_messages(


This is OpenAI specific so it should live in models/openai.py

DouweM · 2025-11-18T18:30:29Z

pydantic_ai_slim/pydantic_ai/_utils.py

+
+def num_tokens_from_messages(
+    messages: list[ChatCompletionMessageParam] | list[ResponseInputItemParam],
+    model: OpenAIModelName = 'gpt-4o-mini-2024-07-18',


We don't need a default value

DouweM · 2025-11-18T18:32:35Z

pydantic_ai_slim/pydantic_ai/_utils.py

+    else:
+        raise NotImplementedError(
+            f"""num_tokens_from_messages() is not implemented for model {model}."""
+        )  # TODO: How to handle other models?


Are you able to reverse engineer the right formula for gpt-5?

As long as we document that this is a best effort calculation and may not be accurate down to the exact token, we can have one branch of logic for "everything before gpt-5" and one for every newer. If future models have different rules, we can update the logic then.

I think with a decreased primer the calculation for gpt5 is more accurate.

Should the method from the cookbook be the default for all other models?

DouweM · 2025-11-18T18:32:55Z

pydantic_ai_slim/pydantic_ai/_utils.py

+    try:
+        encoding = tiktoken.encoding_for_model(model)
+    except KeyError:
+        print('Warning: model not found. Using o200k_base encoding.')  # TODO: How to handle warnings?


No warnings please, let's just make a best effort

pydantic_ai_slim/pydantic_ai/models/openai.py

DouweM · 2025-11-19T22:05:20Z

pydantic_ai_slim/pydantic_ai/models/openai.py

    )
+
+
+def num_tokens_from_messages(


Please make this a private function

DouweM · 2025-11-19T22:06:22Z

pydantic_ai_slim/pydantic_ai/models/openai.py

+    elif 'gpt-5' in model:
+        return num_tokens_from_messages(messages, model='gpt-5-2025-08-07')
+    else:
+        raise NotImplementedError(f"""num_tokens_from_messages() is not implemented for model {model}.""")


Let's simplify all of this as if 'gpt-5' in model: <do the new thing> else: <do the old thing>

So the is executed for all other models except gpt5?

if 'gpt-5' in model: tokens_per_message = 3 final_primer = 2 # "reverse engineered" based on test cases else: # Adapted from https://cookbook.openai.com/examples/how_to_count_tokens_with_tiktoken#6-counting-tokens-for-chat-completions-api-calls tokens_per_message = 3 final_primer = 3 # every reply is primed with <|start|>assistant<|message|>

Or only for the models explicitly in the cookbook and raise error for others?

DouweM · 2025-11-19T22:06:45Z

pydantic_ai_slim/pydantic_ai/models/openai.py

+        'gpt-4o-2024-08-06',
+    }:
+        tokens_per_message = 3
+        final_primer = 3  # every reply is primed with <|start|>assistant<|message|>


Let's include a link to the doc we took this from

Link added.

DouweM · 2025-11-19T22:07:00Z

pydantic_ai_slim/pydantic_ai/models/openai.py

+        'gpt-5-2025-08-07',
+    }:
+        tokens_per_message = 3
+        final_primer = 2


Let's make it explicit that this one was "reverse engineered"

Added comment

pydantic_ai_slim/pydantic_ai/models/openai.py

DouweM · 2025-11-21T20:29:10Z

tests/models/test_openai.py

Please test the entire exception message as here:

pydantic-ai/tests/models/test_anthropic.py

Lines 6449 to 6464 in c050590

async def test_anthropic_model_usage_limit_exceeded(

allow_model_requests: None,

anthropic_api_key: str,

):

model = AnthropicModel('claude-sonnet-4-5', provider=AnthropicProvider(api_key=anthropic_api_key))

agent = Agent(model=model)

with pytest.raises(

UsageLimitExceeded,

match='The next request would exceed the input_tokens_limit of 18 \\(input_tokens=19\\)',

):

await agent.run(

'The quick brown fox jumps over the lazydog.',

usage_limits=UsageLimits(input_tokens_limit=18, count_tokens_before_request=True),

)

DouweM · 2025-11-21T20:29:40Z

tests/models/test_openai.py

This is incorrect :)

DouweM · 2025-11-21T20:32:37Z

tests/models/test_openai.py

We should also have a test for a system other than openai, right? You can get that from building an OpenAIChatModel with a different provider, like OllamaProvider for example

pydantic_ai_slim/pydantic_ai/usage.py

DouweM · 2025-11-21T20:38:06Z

pydantic_ai_slim/pydantic_ai/models/openai.py

+        tokens_per_message = 3
+        final_primer = 2  # "reverse engineered" based on test cases
+    else:
+        # Adapted from https://cookbook.openai.com/examples/how_to_count_tokens_with_tiktoken#6-counting-tokens-for-chat-completions-api-calls


Looking at the cookbook again, I think we should also try to implement support for counting the tokens of tool definitions:

https://cookbook.openai.com/examples/how_to_count_tokens_with_tiktoken#7-counting-tokens-for-chat-completions-with-tool-calls

DouweM · 2025-11-21T20:40:08Z

tests/models/test_openai.py

Instead of testing against our own hard-coded values, can we compare against the real token usage data returned by the API (that will be recorded in cassettes)?

I'd also like to see tests with more complicated message structures, e.g. with tool calls

Great idea.

DouweM · 2025-11-21T20:40:40Z

tests/models/cassettes/test_openai/test_count_tokens[gpt-3.5-turbo-115].yaml

Why are these files so large?

100,286 additions

Can we remove them?

The tiktoken library downloads necessary encoding files e.g. this one. Usually thats cached in TIKTOKEN_CACHE_DIR.

To allow execution of the tests independently the cassette for each test stores this file in the current version.

Is there a better way to handle these MB sized files? Maybe put them in the cache in a fixture?

Ah that makes sense... What do you think about mocking the tiktoken calls, so we can keep these files out of the repo?

DouweM · 2025-11-21T20:46:33Z

pydantic_ai_slim/pydantic_ai/models/openai.py

+    num_tokens = 0
+    for message in messages:
+        num_tokens += tokens_per_message
+        for value in message.values():


This is a bit weird, as it assumes every string value in the message dict will be sent to the model. That may be the case for ChatCompletionMessageParam, but not for ResponseInputItemParam, which is a union that includes things like Message:

class Message(TypedDict, total=False): content: Required[ResponseInputMessageContentListParam] """ A list of one or many input items to the model, containing different content types. """ role: Required[Literal["user", "system", "developer"]] """The role of the message input. One of `user`, `system`, or `developer`.""" status: Literal["in_progress", "completed", "incomplete"] """The status of item. One of `in_progress`, `completed`, or `incomplete`. Populated when items are returned via API. """ type: Literal["message"] """The type of the message input. Always set to `message`."""

I don't think those status and type fields end up with the model. But it'd be worth verifying by comparing our calculation with real data from the API, as I suggested below in the tests.

DouweM · 2025-11-21T20:49:37Z

pydantic_ai_slim/pydantic_ai/models/openai.py

+    for message in messages:
+        num_tokens += tokens_per_message
+        for value in message.values():
+            if isinstance(value, str):


We also don't currently handle lists of strings properly, for example ChatCompletionMessageParam can be ChatCompletionUserMessageParam:

class ChatCompletionUserMessageParam(TypedDict, total=False): content: Required[Union[str, Iterable[ChatCompletionContentPartParam]]] """The contents of the user message.""" role: Required[Literal["user"]] """The role of the messages author, in this case `user`.""" name: str """An optional name for the participant. Provides the model information to differentiate between participants of the same role. """

content may just be a str, but could also be a list of ChatCompletionContentPartTextParam:

class ChatCompletionContentPartTextParam(TypedDict, total=False): text: Required[str] """The text content.""" type: Required[Literal["text"]] """The type of the content part."""

We shouldn't exclude that text from the count.

Same for ResponseInputItemParam, which can have text hidden inside lists.

Unfortunately OpenAI makes it very hard for us to calculate this stuff correctly, but I'd rather have no count_tokens method than one that only works in very specific unrealistic scenarios -- most users are going to have more complicated message histories than the ones we're currently accounting for. So I think we should either implement some smarter behavior (and verify in the tests that it works!), or "give up". Let me know if you're up for the challenge :)

Yes, I am up try to count the tokens for more complicated histories. Seems there are quite a few possible inputs based on your comment.

Are there any test cases which I can use as a starting point which represent a more complicated structure?

@wirthual Not specifically, but if you look at the types in the ModelRequest.parts, UserPromptPart.content and ModelResponse.parts type unions, it's pretty easy to (have AI) build one of each

@wirthual Just found https://github.com/pamelafox/openai-messages-token-helper which may be worth using or looking at for inspiration

Update doc string Co-authored-by: Douwe Maan <[email protected]>

DouweM · 2025-11-27T15:06:09Z

pydantic_ai_slim/pydantic_ai/models/openai.py

+        num_tokens += tokens_per_message
+        for value in message.values():
+            if isinstance(value, str):
+                num_tokens += len(encoding.encode(value))


Since this (or the get_encoding call further up?) could download a large file, but tiktoken is sync not async, we should wrap the call that may do a download in _utils.run_in_executor to run it in a thread

wirthual added 4 commits November 16, 2025 15:30

add token counting

d7f0b87

add replay casettes

80a61f1

add additional cassettes

cc8cbf0

add end to end test

c1be8c1

DouweM requested changes Nov 18, 2025

View reviewed changes

DouweM self-assigned this Nov 18, 2025

DouweM added the awaiting author revision label Nov 18, 2025

wirthual and others added 6 commits November 18, 2025 17:06

add missing cassette for new test

1332cd8

add another missing cassette

cb5da87

shorted test string

6396f5d

address some of the comments

46cd331

add correct cassette

86a0b89

Merge branch 'main' into wirthual/add-token-counting-openai

bacf788

DouweM requested changes Nov 19, 2025

View reviewed changes

wirthual added 3 commits November 20, 2025 12:50

Merge branch 'pydantic:main' into wirthual/add-token-counting-openai

acf86b0

add changes

6d2d4dd

increase test cov

9943173

DouweM requested changes Nov 21, 2025

View reviewed changes

DouweM changed the title ~~WIP: Add count_tokens for openAI models~~ Implement OpenAI token counting using tiktoken Nov 21, 2025

wirthual and others added 2 commits November 24, 2025 09:21

Update pydantic_ai_slim/pydantic_ai/usage.py

75f29fa

Update doc string Co-authored-by: Douwe Maan <[email protected]>

Merge branch 'main' into wirthual/add-token-counting-openai

6deaea2

DouweM requested changes Nov 27, 2025

View reviewed changes

	async def test_anthropic_model_usage_limit_exceeded(
	allow_model_requests: None,
	anthropic_api_key: str,
	):
	model = AnthropicModel('claude-sonnet-4-5', provider=AnthropicProvider(api_key=anthropic_api_key))
	agent = Agent(model=model)

	with pytest.raises(
	UsageLimitExceeded,
	match='The next request would exceed the input_tokens_limit of 18 \\(input_tokens=19\\)',
	):
	await agent.run(
	'The quick brown fox jumps over the lazydog.',
	usage_limits=UsageLimits(input_tokens_limit=18, count_tokens_before_request=True),
	)

Implement OpenAI token counting using tiktoken #3447

Are you sure you want to change the base?

Implement OpenAI token counting using tiktoken #3447

Uh oh!

Conversation

wirthual commented Nov 16, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Implement OpenAI token counting using `tiktoken` #3447

Implement OpenAI token counting using `tiktoken` #3447